perm filename HE[D,LES] blob sn#152215 filedate 1975-05-05 generic text, type T, neo UTF8
.s HAND/EYE SYSTEMS

\\pers Thomas O.Binford, Lynn H. Quam, David Grossman,
Robert C. Bolles, Donald Gennery, Raphael Finkel, Kicha
Ganapathy, Hans P. Moravec, Russell H. Taylor, Victor D. Scheinman,
Yefim Schukin, Bruce E. Shimano, Kurt Widdoes. 

The main  scientific goal of  the hand/eye  project is to  understand
those  facilities of  machine intelligence which  involve interaction
with the real world through perceptual and motor functions.   Our aim
is  to  design,   build,  and  test  machines   with  perception  and
manipulation.

Perception  and   motor  functions  are   intelligent
functions in  that  they  require many  of  the same  mechanisms  and
representations as other  areas of intelligence.  The primary problem
in all these areas is to bring to bear knowledge at all levels in the
system, through  a world model.   This is  done by  representation of
specific  domains in detail,  and description  mechanisms dictated by
the representations.    Goal-directed visual  systems are  a  current
topic  of  great  interest.  They  require  a  planning  or  strategy
component,  whose  power  comes  from  use  of  knowledge and  models
specific to  vision.

Perception  and motor  control  are unique  in
having  an  immediate  need  for  representing  shape  and  geometry,
geometrical  operations,  and  representing  control  structures  for
geometry.  All areas of artificial intelligence converge on the use of
representation  of  various  aspects  of  world  knowledge.   Natural
language will  eventually  require representations  of  geometry  for
everyday  notions  of spatial  relations,  physics  and the  physical
world.       Automatic    programming   will    require   geometrical
representations as soon  as it begins to  deal with geometry, or  the
physical world.

As in speech  understanding, descriptive mechanisms
of vision function in a world with noise and uncertainty, and we must
define the similarity  of symbolic descriptions at various  levels of
detail,  perturbed  by noise  and distorted  by  transformations like
perspective.  Speech, however, is a set of conventions designed to be
understood, in spite of noise.  The world did not design itself to be
seen. 

.oi
Our effort will concentrate on (1) how to represent visual quantities
and  operations in  machines, and  (2) how they  are used  to program
classes of tasks,  not just  isolated tasks.   We have  not made  any
mention of  what part  actual sensors and  manipulators play  in this
research.  We need to test our ideas and algorithms by experiments in
order to develop adequate representations and theories.  We  can do a
lot  of useful  modeling on  an <a priori> basis.   However,  we don't
expect to model  the entire  world.  It  is often  more difficult  to
model the world than  to use the world as a  model.  Particularly for
our    applications,   we   need   carefully    chosen   real   world
experimentation. 

The  application  goals  of  the  group  are   to  design  and  build
programming   systems  for  vision  and   manipulation,  for  use  in
industrial  assembly,   inspection,  assembly   and  maintenance   in
hazardous  environments,   handling  toxic  materials,   and  vehicle
navigation.    Our intent  in  assembly is  to make  it  possible for
production engineers without expertise in computer science  to set up
and program assemblies  in a relatively short time.   That ability is
particularly important  for products  made in  short production  runs
with frequent changes, e.g. for airframes or replacement parts.  Spares
could be produced on demand by stockpiling tapes instead of hardware.

For batch production, the facilities  of the assembly system must  be
more  powerful than  for high  volume  production.   Since much  less
design effort  can go into special jigs and fixtures, the system must
be much more versatile to justify the investment, by  paying off over
several product  runs.  Vision can be  important for batch production
where versatility and easy interfacing are important.  Facilities for
setup  which   use  vision,  such   as  simplified   calibration  and
self-calibration, have economic importance for short production runs.
The system is intended to make use of design data from computer-aided
design.  These techniques are also useful in assembly and maintenance
in hazardous environments, even as aids to teleoperator devices.

Related techniques might
be applied  to  guidance  of remotely  piloted  aircraft  in  hostile
environments.   This  approach could  be regarded  as an  intelligent
man-machine interface.   The system which we  build up for the domain
of assembly contains the basis for other application domains. 

We have  carried out  specific tasks  illustrating general issues  in
building research systems  which are models for practical systems for
industrial assembly.    We  have found  this  strategy  defines sharp
research questions.   These  issues are:  representations of physical
constraints and solutions which allow the system to keep track of the
envelope of  possible object positions;  representations which  allow
obstacle  avoidance to  be  carried out  in  planning and  execution;
representing visual  features  and  operators  and  expressing  these
representations as a vision language;  coarse descriptions abstracted
from  object  descriptions to  be  used in  recognition  to  select a
subclass of similar objects to match against.

Our  system  merits  support  because  of   the  integration  of  our
manipulation work with vision, because of the planning model and very
high  level  language   system,  because   of  our  advanced   object
representation facilities, and because of  the extent of our progress
in  the past in complete systems for  manipulation.  Our subgoals are
as follows:
.bs;ib; preface 1;fac;
(1) To  design and  build a programming system for manipulation, implement
touch and force sensors,
implement control  algorithms  for   touch,  force  control  and
cooperating manipulators. As a test of that system, to assemble a two
stroke gasoline engine.  Although we have done major subassemblies of
this task,  we think the problem is still difficult enough to test our new
sensing and  control facilities and  representation capabilities;  in
particular, we  will couple in visual  control.  We  are still in the
midst of  system  implementation and we are  some  months  away  from  task
execution.  At that time we may choose another task. 

(2) To  provide a  very high  level language  system with a  knowledge
base  for  programming  the manipulator  system  and  to apply  that
knowledge  base   to   translate  programs   written  in   terms   of
assembly-oriented  primitives  into runnable  manipulator  programs.  
This is  a prime scientific goal because of the focus it provides for
studying basic  representational  issues in  robotics  and has  great
potential value for  increasing the cost-effectiveness of manipulator
programming.  These aspects are discussed more fully below. 

(3) To provide a vision language and a visual feedback system which
can be programmed easily to locate features in  scenes where a map is
available.    To   apply  the  system  to  simplify  programming  the
determination of  positioning error  for real  parts being  assembled
(screw  in hole,  sleeve over  shaft,  aligning two  surfaces).   The
resulting  program must execute  in about  1 second.   Objects may be
curved, shiny, dirty,  with no special  painting or preparation.   We
assume variation of less than  1 cm. in part location, and only small
rotations.  The system is intended to be programmed by people without
experience in  computer vision.   A  planning program will  interface
between the user's model and detailed vision programs. 
.end

.ss Manipulation

%2A man can do a lot blind, but  much  less  with  one  arm,  only  two
fingers,  and  poor  sense  of  force and touch.  We have reached the
limit of what we can do with crude touch and force sensing.%1

.cb Accomplishments
We have  built  up a  position of  leadership in  computer-controlled
manipulation  by   a  coherent  program  of  software,  hardware  and
experimentation:  We   have   the  most   complete   and   integrated
computer-controlled manipulation  system.  It is  currently as simple
and   in  many  cases  simpler  to  program  our  computer-controlled
manipulators  than  to  perform  the  same  tasks  with  conventional
manipulators in  "record and playback"  mode.  We  designed and built
high quality manipulators [Scheinman], duplicates of which have  been
built  or  bought  by JPL,  SRI, GM, National  Bureau  of  Standards,
University  of Illinois,  Purdue  University, and  Boston University.
Scheinman designed  a small scale  version of  the arm  while he  was
visiting at MIT.  A company has  been formed which has 8 of these arms
completed or
in   production,   for   Texas   Instruments,   MIT,   University  of
Illinois, SRI, and Purdue University.

We were  first to automatically
generate trajectory control with  software servo [Paul].  Others used
point to point motion with  hardware servo.  Our trajectory  planning
had the  advantage that it  was easy to  generate smooth  motions for
complete  actions.  For comparison,  it was easier for  us to program
the trajectories  in  our  assemblies  by "learning  by  doing"  than
similar assemblies done with Unimates in "record and playback" mode. 
The arm  servo contained several new features: a predictive Newtonian
dynamic model  of  the arm,  including  inertia and  gravity  forces;
feedback   as   a  critically   damped   linear  system;   trajectory
modification which allows self-adaptation of planned trajectories  to
accomodate  variations in  position and  contingencies.   These  were
embedded in  a manipulation system called  WAVE, an interpretive hand
language, split  into a  small  arm servo  program written  in  PDP-6
assembly language, and a trajectory  planning program written in SAIL
for the  PDP-10.  WAVE was not intended to be exportable, but SAIL is
available on PDP-10 systems.

The planning and servo techniques apply
to the class  of arms for which  analytic or  numerical
solutions exist. 
At present, solutions do not exist for arms with redundant degrees of freedom.
 Use of the language WAVE made possible a degree  of
generality; we evolved  a library of  macros for assembly  which made
successive  tasks increasingly quick to  program.

We  were the first
group to perform computer-controlled assembly, as a series of planned
experiments  to test control  facilities.  An  automobile water pump,
piston-crank subassembly  and  clutch  subassembly of  a  two  stroke
gasoline engine, and tool changing  were programmed. 
At about that time, the University of Edinburgh programmed assembly of
a toy car [Ambler].
Since that time,
Kawasaki and Unimation performed assemblies using
"record and playback"  mode.  IBM has  assembled a toy and  a complex
subassembly   of  a   typewriter, which is probably   the   most  complicated
computer-controlled assembly.   Hitachi  has made  a special  purpose
assembly device using force feedback [Goto]  which is in production.
Inoue at MIT  has performed an assembly using force feedback [Inoue].

We have used crude force feedback from measurement of  motor torques,
crude  touch  sensing, and  searching  as  follows: to  increase  the
tolerance of assemblies; to require only crude position and alignment
information and  self-correct the  estimates (to  simplify setup  and
programming);  to allow contingency  action; to  continuously monitor
positions  and   correct   for   drifts  in   calibration.      These
self-calibration and  self-alignment facilities are  shown in  a film
we have produced [3].

We have programmed
synchronized, non-simultaneous manipulators;  we have recently  found
that a  Japanese group  [Nakano] programmed  simultaneous coordinated
motion  of  two  manipulators  at  about  the  same  time.   We  have
implemented a force balance with a sensitivity of about  25 grams, an
order of  magnitude better  than current force  sensing.  We  have an
experimental  touch  sensor  with  1  gm  sensitivity,  an  order  of
magnitude better than the  current touch sensor.  Both  these sensors
need to be extensively evaluated and interfaced. 

.cb Language for Programmable Assembly

Paul,  Finkel,  Taylor,  and  Bolles  have  designed a  language  for
programmable assembly, AL [Finkel], to succeed WAVE.  AL  is intended
as a  research vehicle and is designed to be  modified.  It is
written  in  SAIL and  is  fairly  transportable.
We believe that any  language should
be completely implemented in  a public version after it is developed.
The manipulation portion of AL will be fixed and final by July  1977.
The planning portion of AL will be  well-developed at that stage, but
developments  in strategy and  planning systems will  be desirable to
include in AL  as they occur.

WAVE was a  language at the  assembly
language level.  AL has a  variety of structures which would be quite
difficult  to put into WAVE.  AL  has an ALGOL-like control structure
to allow structured  programs.  Multiple  processes are provided  for
simultaneous control of  several devices.  Interrupts are implemented
by ON-monitors.  Trajectories are specified with greater flexibility,
and more versatile force control will be included.

A planning system is  included, as described  in  detail below.
Briefly, it provides for specifying planning values of  positions and
attachment relations, and  keeps track of planning  values as control
passes  through  control  structures.   The  language  is potentially
useful for  a broad class  of devices  by including special  solution
programs for  each device.   Currently, the  class of manipulators is those
devices with non-redundant degrees  of freedom.   It probably is  not
difficult to  provide solution  programs for  devices with  redundant
degrees of freedom, by specifying constraints on the motion. 


.cb Sensing for Manipulation
We have found no force and touch sensors currently available which are
adequate for our  purposes.  We have made extensive surveys [Binford]
and publicized our requirements to interest commercial development of
sensors, and  to promote  cooperative development.   If there were any
which  were  roughly adequate,  we would  use  them in  preference to
developing our own.   If at  any time we  find such devices, we  will
terminate our own efforts  at sensor development as soon as possible.

At any rate, we  will probably need  to develop computer  interfaces,
which are  not trivial  in this case.   We  need to develop  improved
force  and touch  sensing hardware.    We will  need to  evaluate and
interface sensors which we have developed.  If  difficulties develop,
it may  be necessary  to develop  alternate touch  sensing techniques
which  are simpler to  interface.  We  have made a study  of a driven
piezoelectric touch  sensor which is  promising but requires  further
development.  We will  implement force sensors with sensitivity of 20
grams.  We will implement touch  sensors with sensitivity of 2  grams
on a 2x3 matrix for  each finger.  Those are state of  the art under
the constraints  which make them usable on  a manipulator.

What will
we be able to do with such sensing that we could not  do without?   Touch
can be quite  sensitive, but is unusable with tools  where contact is
at  the tool  and not at  the fingers.   Force sensing  operates at a
distance.   We  can  extend arm  control  to delicate  and  dexterous
operations; to adaptive grasping  of irregular objects; to many small
assemblies which  are easily  damaged; to  picking up  light  objects
which would move away previously; to inserting  a sleeve over a shaft
without  binding; to  exploring with touch;  and to  inserting screws
into holes by tipping them and feeling when they drop into the hole. 

Previously, we have used force sensing  based on motor torques with a
sensitivity  of  500  grams,  and touch  sensing  based  on  a single
microswitch per finger  with sensitivity of  10 grams.  The  previous
sensing   abilities  were   crude  enough   to  strongly   limit  our
manipulation abilities. 

.cb Cooperative Control of Manipulators
We  will  formulate   a  theoretical  basis  for  force   control  of
manipulators, determine  how sensitive force control  can be with our
manipulators [Scheinman], optimize those abilities, provide  language
primitives, and implement them.  There does not exist adequate analysis
of force control of manipulators.  Whitney has written about the
subject, but without a solution. 

We have formulated new force control and synchronization
primitives in AL, but we require experimentation to evaluate and make
more adequate  primitives.    We  will  analyze  and  experiment with
cooperative control  of two  manipulators, to  design and  implement
language  primitives for AL.   Although there is  a Japanese paper  on
the subject, it is only a  beginning and much more remains to  be done.

We   need  coordinated  manipulation   as  a  basis   for  other
manipulation   experiments.   Typical  tasks  requiring   cooperative
control are  installing limp  or semi-rigid  gaskets, carrying  heavy
objects,  picking  up  irregular  objects, and carrying liquid  in  open
containers.   We  will carry  out  tasks which  require  two  arms in
cooperation, sensitive touch, and force, from among the tasks in this
and the preceding paragraph, and in integrated assembly tasks.

.cb "`Very High Level' language for automation"

One very  important factor  in determining  how widely the  potential
advantages of programmable manipulator systems can be realized is the
ease with which the necessary programming can be done  by engineering
personnel who are not necessarily expert computer scientists. 

To  date,  languages  for  control of  manipulators  have  been  very
explicit, requiring:

.bs; preface 1;fac;
(1) very  detailed descriptions  of the  specific motions  and
sensor tests to be made. 

(2) a great deal of book keeping by the user, to keep track of
expected positions of objects and of what calculations must be
made  to update  position  variables as  a  result of  sensory
tests, and the like. 

(3) an intimate understanding of the manipulation system. 

(4) a fairly high degree of programming sophistication. 
.end

AL seeks to minimize,  in so far as possible, the  burdens that these
characteristics place on the user.  For instance, it provides a means
by which one variable may be "affixed" to another, so that if  one is
changed  then   the  other   will  be  updated   appropriately,  thus
substantially reducing the amount of explicit bookkeeping required. 

Despite such  niceties, writing code at the manipulator control level
still requires a  fairly high degree of  programming sophistication. 
Typically, the gross structure  of programs is fairly simple, and may
be described by a sequence or partial ordering of fairly well defined
subtasks.    The  "fine"  structures,   however,  are  somewhat  more
complicated.  Typically,  there  may be  loops,  tests  for exception
conditions, specifications  for  force or  tactile feedback,  and  so
forth.  

The point here is that many users do not particularly care about such
details.  It  would be  much  easier  for them  to  specify  tasks at
somewhat higher  levels of  abstraction.   For instance,  an assembly
engineer who wants to put together a small rotary pump should be able
to write something like:

.bc;verbatim;
    :
    COMMENT This is high-level AL code.;

    FIT pumphead ONTO pumphousing
        WITH ALIGNMENT 
           housing.studx IN head.holex
           housing.study IN head.holey;
    INSERT bolt1 INTO  head.hole1
        WITH TORQUE = 10*FT*LB
        USING TOOL driver;
    INSERT bolt2 INTO  head.hole2
        WITH TORQUE = 10*FT*LB
        USING TOOL driver;
    INSERT drainplug INTO sidehole;

    COMMENT and so forth;
    :
.end

and allow the system to fill in the  details, rather than coding them
himself.  Such a 
code sequence can easily  be written in a few minutes,  whereas
the  corresponding   manipulator  program  takes  even   an  "expert"
programmer several hours to write and debug.  Furthermore, unless our
hypothetical engineer  is an above-average  hand-eye programmer,  the
result is more apt to be what he wants. 

Essentially, the  research  question posed  here is  "How can  expert
knowledge  about hand-eye  programming be codified  into a  system so
that it can be accessible to a non-expert user?"

In addition  to having  great "applications"  importance, we  believe
that work in this area provides  a useful framework for research on a
number of related problems.  These include:

.bs;preface 1;fac;
(1) The  representation   of   information  about   physical
situations and  description of objects in  a form conducive to
reasoning about them. 

(2) Characterization of  how sensory information  affects what
you  know about  a  physical situation  and how  a  given fact
affects how accurate your knowledge can be assumed to be. 

(3) Codification   of   knowledge   about   techniques   for
accomplishing  particular subtasks.   Such  knowledge includes
what restrictions  must be placed  on object  locations for  a
technique to work, what is accomplished, how much error can be
tolerated, what extra information may be gained as a result of
using it, an outline or program skeleton giving the basic code
required, and so forth. 

(4) Understanding  of  how different  parts of  a program  can
affect each other and of how to "fill in" details for one part
in a manner consistent with the requirements of other parts. 
.end

We have chosen small  scale mechanical assembly as a  good domain for
investigating  the incorporation  of such specialized  knowledge into
AL.  There are a number of reasons why this is an attractive choice. 

.bs;preface 1;fac;
(1) The situations one encounters in  the domain are generally
fairly  constrained,  thus  simplifying  somewhat  the  burden
placed on the modelling and planning systems. 
	
(2) The use  of sensory feedback techniques  can significantly
reduce  the  requirements  for expensive  fixtures.    Thus, a
system that helps plan the use of such techniques has a "live"
application. 

(3) It is possible to describe interesting and useful tasks in
the  assembly   domain  with  a  relatively  small  number  of
"primitive" operations. 

(4) Most of the underlying mechanisms are general enough to be
transferred to other manipulatory domains. 
.end

The initial system will have three basic assembly-oriented primitives
(insertion  of shafts and screws  into holes, fitting  nuts & washers
over shafts, and mating surfaces  of two objects according to  simple
alignment  specifications), together  with a  small set  of "service"
primitives  like "pick up" and "place".   These suffice to describe  a
surprisingly large class of tasks. 

One  obvious   method  of   providing  convenient  task   description
formalisms  is  to  combine commonly  occurring  code  sequences into
"macro operations" and then allow the user to write programs in terms
of  those  operations.    Unfortunately,  however,  this  solution  is
inadequate for the assembly domain.  Frequently there are a number of
ways to  do  a  particular subtask.  Which  is "right"  depends  very
largely  upon what other  subtasks must  be done.   Similarly,  it is
frequently possible to perform part of one subtask (or, at least,  to
gather useful  information) in  the course  of doing  another.   Such
considerations  are in general  very difficult to  express within the
paradigm of macro expansion. 

We  will  use  progressive  refinement  to  produce   consistent  and
reasonably  efficient programs.    Here, the  user's initial  program
specification  is rewritten into successively more detailed versions,
until an  executable  program written  in terms  of  the "low  level"
motion  and sensor  control statements  is produced.    The principal
advantage  of  such  a  breadth-first  approach  is  that  it  allows
individual decisions to be  made within the context of  other tasks. 
[Sacerdoti] uses a  somewhat similar approach in planning for the SRI
computer-based consultant  system.   Aside from  a  number of  fairly
substantial  differences in  how the  basic paradigm  is implemented,
this work differs from his in the level of detail under consideration
and in  that  the SRI  problem solver  is seeking  to  guide a  human
repairman,  whereas AL is  producing a  computer program that  runs a
manipulator. 

AL's   implementation   of   progressive   refinement   relies   upon
inter-process communication mechanisms to help  assure that decisions
are made compatibly.  Briefly, this works as follows: knowledge about
the assembly primitives  and the  various manipulation statements  is
encoded into procedures within the system.  Then, with each statement
in a  program graph, the system associates a process instantiation of
the appropriate procedure.  Each process then becomes responsible for
keeping the system's  model of its expected effects  up to date.  The
processes associated with simple  motion statements have very  little
else  to  do.  The "higher  level"  operations  are  responsible  for
suggesting  further elaborations  of  themselves (into  more detailed
substructures) and for evaluating the effects of changes  proposed by
other processes upon their own execution. 

.cb The planning model

The planning model is  one of the central features of  the AL design,
and in many ways  is analogous to the sort of book keeping done by an
algebraic compiler, which  must keep  track of register  assignments,
temporary  variables,  and  similar information.    Essentially,  the
system uses  its "understanding" of the semantics of AL statements to
maintain a data base  containing information about each point  in the
program graph.  This data  base is used at all levels of the planning
system.  For low level AL, the principal use of the planning model is
to keep track of expected values of  variables, especially those used
to hold coordinate  frames.  These planning values, in turn, are used
in preparing motion trajectories. 

At higher  levels, the  planning model  provides the essential  basis
upon  which  the  system  can  base  decisions  on how  to  translate
task-oriented statements into  the apropriate motion  primitives.
In this  case, the system  needs a  much better understanding  of the
expected  state  of  the  "world"  at  each  point  in the  program.  
Frequently, this information is most conveniently  specified in terms
of  semantic  relations  between  objects.   For  instance,  one  can
describe the location of a cup  by saying that it is sitting  upright
in some region on a table.  This information may then be reflected in
a  set of mathematical constraints  on the location  variables of the
cup.  Such  constraints may  then be  solved to give  an estimate  of
possible locations,  and a similar  analysis can be used  to estimate
likely  error values.  In AL, we will  make extensive use of both the
symbolic and the mathematical  form of constraint relations  in order
to help in forming plans. 

.cb Accomplishments

The high-level language constructs for AL have been designed and used
to describe several  sample assembly tasks including the assembly of
a simple water pump, assembly of a metal box, attachment of a bracket
to a beam, and similar tasks.   The task-oriented primitives have all
been   described   in   terms  of   manipulator   actions.     Object
representations have been developed and debugged,  as have procedures
for  deriving the  accuracy prerequisites  of one  of  the primitives
(insertion of a shaft into a hole) from the parts descriptions. 

A preliminary version of the problem solving paradigm was written and
debugged during the early design phases of  AL.  It generated outline
plans  for the pump assembly  task from partial  orders of high-level
operations, and selected workpiece positions and subtask orderings so
as  to  minimize  superfluous  repositioning  of  the  workpiece  and
unnecessarily returning a tool to its rack. 

The code  for  maintaining facts  in the  AL planning  model and  for
propagating  them across  control structures  have been  written and
debugged, as  have the  procedures for  calculating  and storing  the
planning values used by the "low level" manipulator statements.  Beyond
this,   the   compile-time  expression   evaluation   and  conditional
compilation facilities of AL have been substantially debugged. 

A system has been  developed to automate the translation  of semantic
relations   between  objects   and  the   corresponding  mathematical
constraints, and to  use the  latter to produce  range estimates  for
location and  error variables.   The components  of this system  have
been run independently and the whole thing is being incorporated into
the AL planning  model.   We expect  to use it  quite extensively  in
planning  both  for  manipulation  tasks and  for  visual  feedback.  
[Ambler   and  Popplestone]   follow  a  very   similar  analysis  in
translating relations between objects into  mathematical constraints,
which they then  solve algebraicly.  They do  not, however, deal with
ranges of values or with error estimates, and generally make less use
of the semantic relations themselves. 

.cb Milestones 

An initial version of AL  will be operating by January
1976.  A  complete version will be operational by January 1977.  That
version will include cooperative motion of two arms, force  and touch
sensors, the parser,  planning model, and user  environment.  A force
sensor  with 25 grams  sensitivity will  be operational on  an arm by
January 1976.  Touch sensors with 2 gm sensitivity  in a small matrix
of  2x3 will  be  operational on  one hand  by  July 1976.    We will
complete experiments  analyzing  texture  and shape  with  touch  and
formulate language primitives by January 1977.   We will complete the
theoretical  analysis,  experimentation and  formulation  of language
primitives for force control by July 1976. 

.begin "chart"
.area text lines 4 to 50;
.place text
.next page;
.nofill; select 5;
.narrow 5;
  		July 1975 to Nov 1976
1975			   1976			         1976
July			    Jan				 July
|   |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |
*                            |                             |
---AL prelim version--------→|------------final version------------------------→
							
         ----user system------------------------------------------------------→|

*

---planning system----------→|------more primitives,error recovery------------→|
---collision detection------------→|

*

---force sensing on arm----------→|
--analysis of force-----------------------→|

*-touch sensor expts------→|--operating on arm------------→|
                            -----texture expts------------→|
						   --expts shape by touch-----→|


*----arm 2----------------→|--cooperating arms expts-------→
  ARM INTERFACE



*---planning system----------→|--------3d models-----------→|----stereo--------→|
  VISUAL FEEDBACK


-------design---→|  VISION LANGUAGE
*                -----------implement 2nd version-------------------→|

--------optimal curve--------------------------→|

.end "chart";
.area text lines 4 to 53 IN 2 COLUMNS 5 APART
.ssname←NULL; next page;
.place text
.ss Vision
.cb Overview
We believe that  our program addresses  most of the  major scientific
problems  in machine perception:
.bc;
	representation of shape
	stereo descriptive mechanisms
	texture descriptive mechanisms
	programming classes of visual problems
	using a large visual memory
	what can be done with large computational power
	goal-directed top-down verification
	strategies with shape descriptors.
.end

We  pose several  questions:

(1)  what tasks  can  be done  with  current
computing capacity?   PDP10 class machines have about a million times
less computing power than  the human visual system (Hubel-Wiesel  and
stereo cells).   By choosing our problems carefully, we  can do a lot
with  current computing  power; the  visual feedback  tasks described
below are  good examples.   And we  can investigate  algorithms which
might require extensive computation. 

(2) What can be done with a million  times current computing capacity and
large memory?  Special purpose computers well within the state of the
art can gain a  factor of 1000.   CCD signal processing elements  and
revolutionary advances  in semi-conductor technology  promise further
gains.   For example, Noyce of Intel states that the density of gates
on a chip increases a factor  of two each year.  If we assume  only a
factor  of two every  two years,  in twenty years  we would  gain the
other factor of 1000, or in 15 years with a factor of 10 increase  in
logic speed.   If we had such  machines, we could make  better use of
current techniques  which we avoid now  because of computation cost. 
That would not solve our  problems; there would still be many  things
we don't  know how  to do,  and we could  not make  full use  of that
processing  power.    We  are  evaluating  CCD  signal processors  and
algorithms suitable for that technology.  We  should not develop that
hardware;  it  is  being developed  by  others.    We can  offer  some
guidance on  what is  worthwhile to  do in  hardware.   We  should be
prepared to understand algorithms, even those which are prohibitively
costly in computation by current standards. 

We consider two environments, programmable  assembly and road scenes,
because  they offer  a  range of  complexity (e.g..    shadows, shapes,
motion,  etc.) and  a  certain  amount  of  structure  which  can  be
capitalized upon  in order  to solve various  tasks.   Typical visual
feedback tasks within these environments are:

.bs;
	(1) visually servo a bearing over a shaft
.begin indent 4,7;
		(a) visually locate the shaft (i.e.. determine its angle
		     and the position of the end)

		(b) visually locate the bearing with respect to the
		     end of the shaft
.end
	(2) visually check to see if there is a screw on the end of
	    the screwdriver

	(3) locate a bolt that has been accidently dropped into an engine
	    casing

	(4) inspect the paint job on a car

	(5) navigate a vehicle from one point to another along a road.

	(6) determine what the object in the middle of the road is
	    (a box, a dog, a child, etc.).
.end

Our research  in visual feedback  differs from  other work in  having
more  adequate representation  of  shape, stronger  shape descriptive
mechanisms (curve descriptors) and making much stronger use of shape.
SRI uses a  strategy system which finds objects on  the basis of very
local  properties, such as color and  range data.  Consider the  three
subsystems:  acquisition  of  candidates  for  the   desired  object;
verification of  detailed match of candidates to  the desired object;
automatic plan  generation.   Our  acquisition  procedures  use shape
information, not just local properties.   Similarly, our verification
procedures and  strategy generation have better shape descriptors and
better shape representations. 

.cb Representation
We have worked  extensively on representation for shape  of objects. 
Our descriptive techniques for complex objects were determined by our
representation [Nevatia,  Agin].  We  now intend  to represent  other
data  structures,  control  structures,  and  strategies  for  visual
perception.    We  assume  that  we  can  choose  a  small  class  of
non-equivalent representations and that they are not  a collection of
ad hoc, unrelated structures, but reflect a common basis in 2D and 3D
geometry.    We  believe  that  effective  automatic  generation   of
strategies  is  simplified  by clear  semantics  for  primitives  and
interfaces. 

.oi
A long  range  goal is  to analyze  the  computational complexity  of
structures and operations.  The immediate goal is to characterize the
computation cost of primitives and simple control structures  such as
searches, to evaluate effective  strategies.  This informal semantics
is  a basis for comparing similar tasks,  and programming one task in
analogy with another.   It  is also a  basis for comparing  different
programs and evaluating  experiments.  The semantics define a <vision
language>.  We see it as a means of simplifying programming, as a way
of building up a system based on accumulated work,  and as a means of
cooperation among various workers. 

.cb Accomplishments

Execution: Bolles has written a program which uses a training picture
to  characterize a curve  (the contrast across  it, its distinctness,
etc.) and is then able to locate a point (or segment) from that curve
in an  `unknown' picture containing  essentially the same  curve.  We
are  quite  familiar  with  other  features:  correlation  [Quam] and
[Hannah],  edge  operator  [Hueckel],  region  growing  [Yakimovsky],
texture [Bajcsy], and contouring [Baumgart].  

There  are   also  programs  available   for  interactive   input  of
two-dimensional   and  three-dimensional  models.     These  are  not
complete, but  they are useful now.   The two-dimensional  system
allows a  user to  point out important  features such  as correlation
points,  curves,  and regions.    Essentially the  system  displays a
picture on the screen and  the user can `draw' on top of  it, marking
important features.  The three-dimensional program allows the user to
specify objects composed of parts (possibly unions, intersections, or
subtractions).  Each part  has a spine along which  there are several
cross-sections.   A cross-section may  be any non-intersecting  closed
curve.   Thus,  a  shaft  is represented  as  a  straight  spine with
circular cross-sections.   A cube is represented as  a straight spine
with square cross-sections.  Curves (such as circles) are approximated
by lines when they are displayed.  

Planning: We have used models within the  blocks world to predict the
scene to be analyzed for visual feedback: [Gill] and [Perkins].  With
respect to real world planning, Binford has supervised the thesis  of
Garvey at SRI on visual strategies for office scenes.

.cb Execution Program for Visual Feedback

We will use a  set of primitive operators which  already exist (curve
matching,  correlation, contours)  to locate  features from  a model.
Humans have difficulty  guiding assemblies without  stereo.  We  will
use stereo to measure spatial misalignment.   The world model will be
a graph whose links are perceptual operations and whose nodes are the
symbolic outputs of  those operators.   An abbreviated evaluation  of
the model  graph at runtime will allow  alternatives for contingency.
Either the  user or  the  planning system  for visual  feedback  will
provide effective  strategies.

For example, in the  screw-in-hole problem, the hole  will be located
by correlating with a small area centered on the hole in the training
image.  It  is costly to search  by correlation for  the hole over  a
large portion of  an image, and prone  to error.  The  hole is small;
this  is like searching for a needle in  a haystack.  Instead, a long
curve may  be inexpensive  to find.   A scan  with the  edge operator
along  a single line  intersects the  curve.   A few  additional edge
vectors confirm that  the model  curve predicts the  curve which  was
found.   Once  the curve  is found,  it  is possible  to predict  the
location  of  the hole  or  inexpensively locate  another  feature to
predict the location of the hole.  Then correlation provides  a final
precise location. 

What are the limitations of our planned program for visual feedback?
.bc;
	weak descriptors for texture
	limited to small angular and position shifts; almost 2D.
.end
Descriptive  primitives for  texture  are  among the  most  important
problems. The work  of Bajcsy [Bajcsy, Lieberman] is perhaps the best,
but enormous work  remains to be  done.  The use  of spatial features
obtained using  stereo is common to all  our planned vision work, and
works well with texture, although it is costly. 

.cb Planning Program
It is possible to input the planning model in three different ways:
.bc;
	manual 2D input
	manual 3D input
	automatic 2D and 3D input.
.end
The  first is  a  manual  2D mode,  with  the user  outlining  region
features  under keyboard  control of  a cursor,  and the  user making
associations between the 2D  and 3D models.   In the second mode,  3D
models will be  input in our representation for objects.   A 2D model
will  be generated  using computer  graphics.   In the third  mode, a
program for  automatic input  from  2D images  will characterize  the
image in terms of  stereo and region boundaries, and make association
with the 3D model. The first two exist in usable forms.

To generate  execution programs,  goals (locate  hole, locate  screw)
will be translated into a graph of model states linked by primitives. 
A relaxation of  the graph  will determine an  efficient subgraph  to
attain the goals of locating parts. 

.ss Planning Program Input Sub-objective

I.  Automatically  construct  a  2D  model  of  an  image  (screw  in
distributor  case)   to  simplify  man-machine  interface  in  visual
feedback programming. 

II. Generate a  3D model of an assembly  from a sample scene,  with a
simple symbolic model as a guide. 

.cb Accomplishments
Nevatia implemented a system  which recognized a doll, a toy  horse, a
glove, and several other objects from a  structured visual  memory [Nevatia].   The system
matched descriptions of objects against models of objects it had seen
before.  The  system made up its own  models, which were descriptions
(sometimes  modified by  humans) of previously  seen objects.   For a
large  visual  memory,  it   is  unreasonable  to  match   an  object
description  against all  models  in memory.   A  beginning  was made
toward recognition with  a large visual  memory, although only  about
six models  were  used.   Models were  indexed  according to  summary
descriptions of  the object shape.  Only  models with similar summary
descriptions were compared in detail. 

Descriptions follow  a representation of  shape based  on generalized
cone  primitives  [Binford].   In  this  representation,  objects are
described in terms of parts.   For example, a human has a  body, four
armlike  projections  and  another projection  which  isn't  extended
(head).  The primitive parts of the representation are armlike parts
defined  by smoothly  varying  cross  sections along  a  space  curve
(generalized translational invariance). 

Parts are described by the axis and cross section.  The original data
is   three-dimensional,  although  the   description  techniques  are
valuable  for TV  data.    Agin  built  a  laser  ranging  system  and
implemented  a  preliminary  version   of  part  description  [Agin].
Nevatia's programs use depth data from the laser ranging system, find
boundaries of continuous surfaces, make descriptions of armlike parts
and piece together parts into complete descriptions [Nevatia]. 

.cb Plan
We plan to  automate the building of  the model used by  the strategy
module  in visual feedback.   Now  the programmer builds  the model.
The final result of  this will be that  to program a visual  feedback
task will require only putting down  an example of of an assembly and
supplying  a task  statement (put  the sleeve  over the  shaft).   The
process   will  be   equally   automated   for  objects   for   which
computer-aided design models are available. 

To facilitate  setup, stereo  will make  it possible  to build  space
models of  parts to be assembled.   This will allow  us to extend the
capabilities of execution  programs for  visual feedback to  assembly
with large position  variations (10 cm) or  large angular variations.
In imagining applications of vision to assembly, we immediately think
of very  constrained,  repetitive  visual feedback  tasks.   However,
picking parts  from a bin  is a common  industrial assembly subtask. 
The  work   on  visual   feedback  provides   modules   (description,
representation, strategies) which make possible  more powerful visual
systems.   We will  use a  combination of depth  discontinuities from
stereo and/or laser ranging device, and color region boundaries.   We
will  describe  parts  of  region boundaries  using  techniques  from
Nevatia  [Nevatia].    We will  build  up  spatial  descriptions from
boundary information and 2d cues.  We will separate  objects based on
spatial  boundaries  and  additional  segmentation  at  places  where
objects touch (objects on the table, for example).  We will recognize
objects  from   spatial  descriptions  of   parts  and   confirm  the
segmentation into objects. 

Region  boundary  techniques are  inadequate  because  they
threshold on a very local context.   Some global optimal curve search
techniques have  been developed [Martelli, Chien], but these are much
too special   for  our purposes,  and computationally  expensive.
Our plan  is to  limit computation cost  by 1) decreasing the  number of
possible curves by limiting to  smooth curves, and 2) by cascading  sums
to form larger support.   We will combine the output of  two or more
Hueckel type  operators [Hueckel]  and threshold  not on the  Hueckel
disk,  but over a larger  support.  We  will analyze the  theory of a
locally optimal curve technique, determine the computation  cost, and
implement it if warranted. 

.cb Milestones

We  will complete  a  system  for  verification vision  including  an
execution module, a  2D world model, and a planning system by January
1977.  The execution  system will include stereo.   We will test  the
system by  visual control of  insertion of a screw  into a hole  in a
distributor body by Jan 1976, and put a sleeve over a shaft by July 1976.
  We will
have both feature-based  and area-based stereo  in use by July  1976.
We will implement the  model graph and evaluation programs to provide
automatic generation of programs for  screw in hole by January  1976.
We will have  3D input with 2D  image input from graphics  by January
1977. 
We  will  have a  vision  language  with  2D model  and  model  graph
facilities by July 1976. 

We will make  the theoretical analysis  of the optimal  curve search,
and  determine the computation  cost by July  1976.  We  will have an
automatic 2D model generation program by July 1976.  The basic higher
level perceptual program structure will be developed for the planning
program for visual feedback. 

.ss Analysis and Modelling of Natural Scenes

We propose to continue research in computer analysis of natural
scenes such as rugged terrain, desert, roads, and streets with
emphasis on applications to vehicle guidance.  A practical solution
to the problem of navigating a vehicle through an unknown environment
requires the ability to incrementally acquire models for newly
experienced parts of the world, to verify and refine previous world
models, and to detect obstacles and moving objects. 

Of primary importance are geometric models of the world, which enable
the selection of navigation routes satisfying constraints such as
surface roughness and slopes. 

Photometric properties such as color, reflectivity, and texture will
be used along with geometric models to provide better scene
description and segmentation. 

.cb Approach

We propose to analyze multiple views of the world, both conventional
left-right stereo and motion parallax.  Pairs of images along with
geometric models for the camera position and orientation are adequate
to generate a 3D model for the portions of the scene common to the
pairs of views. 

.cb Accomplishments

We have a substantial foundation of experience and 
tools to build upon. Some of the major achievements are:
.bc;
  a. Parallax region analyzer [Hannah];
  b. Experimental automatic photogrammetry system [Quam];
  c. Visual feedback cart servo program (Quam, Moravec);
  d. GEOMED 3-dimensional modelling [Baumgart].
.end

.cb Types of Scenes and Images

We will digitize multiple views of a diverse collection of complex
natural scenes.  Since this is primarily a geometric modelling
experiment, we will need measurements to define the location and
orientation of the camera for each view.  These scenes are chosen to
cover a broad range of vehicle guidance environments.  High
resolution is important.  We would like at least 1200x800 pixel
digitization resolution, with 8 bits per primary color.



.ss Hardware plans

Our approach to hardware is: how can we get the maximum of research
from a minimum of system and hardware effort.  Some considerations
are:
.bs;fac;
(1) vision is limited by computer speed, not by input device
speed,

(2) we prefer to buy commercial devices and interfaces rather than
develop our own,

(3) it is adequate, usually preferable, to work from disk images
for 95α% of our work

(4) we do not really require computer control of all camera facilities
and can wait until we have firm immediate requirements before
worrying about that

(5) it is extremely valuable to have color image  output
.end
.hehard:
.cb Hand-held Camera

For visual feedback in assembly, it is planned to have a camera which
one hand can bring to the work area.  We propose to purchase either a
small tv camera or a 100x100 solid state array from GE  or Fairchild.
The solid  state camera  is much  smaller.   We may  wait a  year for
higher resolution  solid state cameras.  The price will be higher.  A
unibus or mapping bus interface is adequate for  these devices, which
have  low data  rate.   The other  considerations are  mechanical and
computer interfacing of pan/tilt, etc.  We will not do any  pan/tilt,
since it  can be just  picked up  by the hand.   The cost,  including
interface, will be about $6,000. 

.cb Stereo Camera System

All of  our vision projects  rely extensively on  stereo.   A stable,
well-designed   stereo  configuration   is  needed   which  maintains
calibration. 
The configuration  depends upon  a design study.  We have not found a
suitable commercial system.  The
cost will be about $15,000. 

.bib

[Agin] Agin, Gerald J., Thomas O. Binford, "Computer Description of
Curved Objects", <Proc. Third International Joint
Conf. on Artificial Intelligence>, Stanford University, August
1973.

[Agin] Agin, G. J., "Representation and Description  of Curved Objects" Stanford
        Artificial Intelligence Project Memo No. 173, October 1972.

[Ambler] Ambler, A.P., H.G. Barrow, C.M.Brown, R.M.Burstall, R.J.Popplestone
"A Versatile Computer-Controlled Assembly System", Dept of Machine Intelligence,
University of Edinburgh

[Ambler and Popplestone]  A. P. Ambler and R. J. Popplestone, "Inferring
the Positions of Bodies from Specified Spatial Relationships",  manuscript,
Dept. of Machine Intelligence, University of Edinburgh.

[Bajcsy] Bajcsy, Ruzena, "Computer Description of Textured Scenes", <Proc.
Third Int. Joint Conf. on Artificial Intelligence>, Stanford U.,
1973.

[Baumgart] Bruce G. Baumgart, "GEOMED - A Geometric Editor",
AIM-232, May 1974.

[Baumgart] Bruce G. Baumgart, "Geometric Modeling for Computer Vision",
AIM-249, October 1974.

[Binford] T.O. Binford, "Visual  Perception  by  Computer", Invited paper
at <IEEE Systems Science and Cybernetics,> Miami, December 1971.

[Bolles] Bolles, R.  C.  and  Paul,  R.,   "The  Use  of  Sensory  Feedback  in  a
        Programmable Assembly System", Stanford  Artificial Intelligence Project
        Memo No. 220, October 1973.

[Finkel]
Raphael Finkel, Russell Taylor, Robert Bolles, Richard Paul, Jerome Feldman,
"AL, A Programming System for Automation", AIM-243, November 1974.

[Gill] Gill, A.,  "Visual Feedback and  Related Problems in  Computer Controlled
        Hand-Eye  Coordination", Stanford  Artificial Intelligence  Project Memo
        No. 178, October 1972.

[Goto] T. Goto, T. Inoyama and K. Takeyasu,"Precise Insert Operation by
Tactile Controlled Robot "HI-T-HAND EXPERT-2"", Proceedings 4th International
Conference on Industrial Robots, p209, 1974

[Hannah] Marsha Jo Hannah, "Computer Matching of Areas in Stereo Images",
<Ph.D. Thesis in Computer Science>, AIM-239, July 1974.

[Hueckel] M.H. Hueckel, "An Operator Which Locates Edges in Digitized
Pictures", AIM-105, December 1969; also in JACM, Vol. 18, No. 1, January 1971.

[Inoue] H. Inoue, "Force Feedback in Precise Assembly Tasks", MIT AI
Memo.

[Lieberman] Lawrence Lieberman, "Computer Recognition and Description of
Natural Scenes", PhD Dissertation, Univ of Pennsylvania, 1974

[Luckham]  David  Luckham  and  Jack  Buchanan,  "Automatic Generation of
Programs  Containing Conditional Statements", <Proc.
A.I.S.B. Summer Conference,> Sussex, England, July 1974.

[Nakano] E.Nakano, X. Ozaki, T. Ishida, I. Kato, "Cooperational control
of the Anthropomorphous Manipulator "MELARM"", Proc 4th International
Conference on Industrial Robots, p251, 1974.

[Nevatia] R.K. Nevatia and T.O. Binford, "Structured Descriptions of
Complex Objects", <Third Int. Joint Conf. on AI>, Stanford, Calif, 1973.

[Paul] R. Paul, "Modelling, Trajectory Calculation and Servoing of  a
Computer  Controlled  Arm", <Ph.D. Thesis in Computer Science,>
AIM-177, September 1972.

[Perkins] Walton A. Perkins, Thomas O. Binford.,
"A Corner Finder for Visual Feedback", AIM-214, September 1973.

[Quam] Quam, Lynn H.  "Computer  Comparison of Pictures",  Stanford  Artificial
        Intelligence Project Memo No. 144.

[Quam] Lynn Quam, Marsha Jo Hannah, "Stanford Automatic Photogrammetry Research",
AIM-254, November 1974.

[Sacerdoti] Earl D. Sacerdoti, "The Nonlinear Nature of Plans".  Stanford
	Research Institute Artificial Intelligence Group Technical Note 101,
	January, 1975.

[Scheinman] V. D. Scheinman, Design  of  a  Computer  Manipulator,
Stanford Artificial Intelligence Project, Memo AIM-92, June 1969.

[Yakimovsky] Yakimovsky, Y., "Scene Analysis  using a  Semantic Base  for  Region
        Growing", Stanford Artificial Intelligence Project Memo No. 209.

[Yakimovsky] Yakimovsky,   Y. and Feldman,  J., "A Semantics-based  Decision Theory
        Region   Analyzer",    Proceedings  of  the  Third  International  Joint

.CB FILMS

⊗ Richard Paul and Karl Pingle, "Instant Insanity", 16mm color, silent
6 min, August 1971.

⊗ Pingle, Paul and Bolles "Automated Pump Assembly", 16mm color,
Silent, 7min, April 1973.

⊗ Pingle, Paul and  Bolles,  "Automated  Assembly, Three  Short  Examples",
16mm color, sound, November 1974.

.end